HPC Project: CTC loss for RNN speech recognition
نویسندگان
چکیده
One of the major challenges in speech recogntion or any other field, that concerns itself with structured predictions, is the alginment of two different sequences. The training data for training an RNN is a set of utterances, consisting of audio recorded via a regular microphone and the transcription of the spoken words. This transcription may either be in terms of phonemes or characters. It is clear, that given a fixed length of input audio signal, the number of spoken words within that signal may vary. Therefore, in constructing a function that maps from audio to transcription, such as a neural network, an essential challenge is to align the audio with transitions between characters.
منابع مشابه
Multitask Learning with CTC and Segmental CRF for Speech Recognition
Segmental conditional random fields (SCRFs) and connectionist temporal classification (CTC) are two sequence labeling methods used for end-to-end training of speech recognition models. Both models define a transcription probability by marginalizing decisions about latent segmentation alternatives to derive a sequence probability: the former uses a globally normalized joint model of segment labe...
متن کاملAudio Visual Speech Recognition Using Deep Recurrent Neural Networks
In this work, we propose a training algorithm for an audiovisual automatic speech recognition (AV-ASR) system using deep recurrent neural network (RNN).First, we train a deep RNN acoustic model with a Connectionist Temporal Classification (CTC) objective function. The frame labels obtained from the acoustic model are then used to perform a non-linear dimensionality reduction of the visual featu...
متن کاملRobust end-to-end deep audiovisual speech recognition
Speech is one of the most effective ways of communication among humans. Even though audio is the most common way of transmitting speech, very important information can be found in other modalities, such as vision. Vision is particularly useful when the acoustic signal is corrupted. Multi-modal speech recognition however has not yet found wide-spread use, mostly because the temporal alignment an...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملSequence to Sequence Training of CTC-RNNs with Partial Windowing
Connectionist temporal classification (CTC) based supervised sequence training of recurrent neural networks (RNNs) has shown great success in many machine learning areas including endto-end speech and handwritten character recognition. For the CTC training, however, it is required to unroll (or unfold) the RNN by the length of an input sequence. This unrolling requires a lot of memory and hinde...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015